zero_one_loss (classification error / 1 - accuracy)#
zero_one_loss is the simplest classification loss: it counts how often the predicted label differs from the true label.
It is a great evaluation metric for “did we get the label right?”, but a poor training objective for gradient-based optimization because it is discontinuous / non-differentiable.
Learning goals#
write the binary and multiclass definitions in clean notation
understand the link to accuracy and (for binary) the confusion matrix
implement
zero_one_lossin NumPy (with optionalsample_weight)build intuition via threshold and parameter-surface plots (Plotly)
see how 0-1 loss is used for selection/optimization in practice (threshold tuning)
Quick import#
from sklearn.metrics import zero_one_loss
Table of contents#
Definition and notation
Intuition: thresholds and decision rules (plots)
NumPy implementation + sanity checks
Using 0-1 loss for selection/optimization
Pros, cons, pitfalls
References (quick)#
scikit-learn docs: https://scikit-learn.org/stable/api/sklearn.metrics.html
ESL (Hastie, Tibshirani, Friedman): “The Elements of Statistical Learning” (classification + empirical risk)
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, zero_one_loss as sk_zero_one_loss
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)
1) Definition and notation#
Assume we have \(n\) examples.
True label: \(y_i\)
Predicted label: \(\hat{y}_i\)
Per-example 0-1 loss#
Aggregate (count vs mean)#
Unnormalized (count of mistakes):
Normalized (fraction of mistakes):
This is exactly:
Sample-weighted version#
Given weights \(w_i \ge 0\) (e.g. importance weights, class weights), the normalized weighted 0-1 loss is:
Multiclass and multilabel#
Multiclass (\(K\) classes): \(y_i \in \{0,\dots,K-1\}\) and the same formula applies.
Multilabel / multioutput: \(y_i\) is a vector. scikit-learn’s
zero_one_lossuses subset 0-1 loss:
i.e. the whole label vector must match exactly. (This is often stricter than what you want; see pitfalls.)
Bayes optimal decision rule (why argmax probability is optimal)#
Let the model output class probabilities \(p_k(x) = P(Y=k\mid X=x)\). The classifier that minimizes the expected 0-1 loss is:
Binary case with \(\eta(x)=P(Y=1\mid X=x)\) and equal misclassification costs:
With costs \(c_{01}\) (false positive) and \(c_{10}\) (false negative), the optimal threshold becomes:
def sigmoid(z):
z = np.asarray(z, dtype=float)
return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))
def zero_one_loss_np(y_true, y_pred, *, normalize=True, sample_weight=None):
"""NumPy implementation of scikit-learn's zero_one_loss.
- If y is 1D: counts elementwise mismatches.
- If y is 2D (multilabel / multioutput): uses subset 0-1 loss (row must match exactly).
"""
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}")
if y_true.ndim == 0:
raise ValueError("y_true must be 1D or 2D")
if y_true.ndim == 1:
incorrect = (y_true != y_pred)
else:
incorrect = np.any(y_true != y_pred, axis=1)
incorrect = incorrect.astype(float)
n = incorrect.shape[0]
if sample_weight is None:
total = float(incorrect.sum())
return total / n if normalize else total
w = np.asarray(sample_weight, dtype=float)
if w.ndim != 1 or w.shape[0] != n:
raise ValueError(f"sample_weight must be shape (n,), got {w.shape}")
total = float(np.sum(w * incorrect))
if not normalize:
return total
w_sum = float(w.sum())
if w_sum == 0:
return 0.0
return total / w_sum
def predict_labels_from_proba(p, *, threshold=0.5):
"""Convert probabilities to hard labels.
- Binary: p is (n,) or (n,2) (assumes column 1 is P(y=1)).
- Multiclass: p is (n,K) -> argmax.
"""
p = np.asarray(p, dtype=float)
if p.ndim == 1:
return (p >= threshold).astype(int)
if p.ndim == 2 and p.shape[1] == 2:
return (p[:, 1] >= threshold).astype(int)
if p.ndim == 2:
return np.argmax(p, axis=1)
raise ValueError(f"p must be 1D or 2D, got shape {p.shape}")
def zero_one_loss_from_proba(
y_true,
p,
*,
threshold=0.5,
normalize=True,
sample_weight=None,
):
y_pred = predict_labels_from_proba(p, threshold=threshold)
return zero_one_loss_np(y_true, y_pred, normalize=normalize, sample_weight=sample_weight)
def log_loss_binary(y_true, p, *, sample_weight=None, eps=1e-15):
y_true = np.asarray(y_true, dtype=float)
p = np.asarray(p, dtype=float)
p = np.clip(p, eps, 1 - eps)
per_sample = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
if sample_weight is None:
return float(per_sample.mean())
w = np.asarray(sample_weight, dtype=float)
w_sum = float(w.sum())
if w_sum == 0:
return 0.0
return float(np.sum(w * per_sample) / w_sum)
def best_threshold_zero_one(y_true, p, *, sample_weight=None, normalize=True):
"""Find an exact minimizer over thresholds t in [0, 1] (binary, rule: p>=t -> 1).
The predictions only change when t crosses a value in p, so evaluating t over unique p values
(plus the endpoints 0 and 1) is enough to find the exact optimum.
"""
y_true = np.asarray(y_true)
p = np.asarray(p, dtype=float)
if y_true.shape != p.shape or p.ndim != 1:
raise ValueError("y_true and p must be 1D arrays of the same shape")
if sample_weight is None:
w = np.ones_like(p, dtype=float)
else:
w = np.asarray(sample_weight, dtype=float)
if w.shape != p.shape:
raise ValueError("sample_weight must have the same shape as p")
order = np.argsort(p)
p_s = p[order]
y_s = y_true[order]
w_s = w[order]
w_pos = w_s * (y_s == 1)
w_neg = w_s * (y_s == 0)
cum_pos = np.cumsum(w_pos)
cum_neg = np.cumsum(w_neg)
total_neg = float(cum_neg[-1])
uniq = np.unique(p_s)
thresholds = np.unique(np.concatenate(([0.0], uniq, [1.0])))
start = np.searchsorted(p_s, thresholds, side="left")
before = start - 1
pos_below = np.where(before >= 0, cum_pos[before], 0.0)
neg_below = np.where(before >= 0, cum_neg[before], 0.0)
misclassified = pos_below + (total_neg - neg_below)
if normalize:
denom = float(w_s.sum())
losses = misclassified / denom if denom > 0 else np.zeros_like(misclassified)
else:
losses = misclassified
best_j = int(np.argmin(losses))
return float(thresholds[best_j]), float(losses[best_j])
def standardize_fit_transform(X):
X = np.asarray(X, dtype=float)
mean = X.mean(axis=0)
std = X.std(axis=0)
std = np.where(std == 0, 1.0, std)
return (X - mean) / std, mean, std
def standardize_transform(X, mean, std):
X = np.asarray(X, dtype=float)
std = np.where(std == 0, 1.0, std)
return (X - mean) / std
def fit_logistic_regression_gd(
X_train,
y_train,
X_val=None,
y_val=None,
*,
lr=0.2,
n_steps=300,
l2=0.0,
threshold=0.5,
):
X_train = np.asarray(X_train, dtype=float)
y_train = np.asarray(y_train, dtype=int)
n, d = X_train.shape
w = np.zeros(d, dtype=float)
b = 0.0
hist = {
"step": [],
"train_log_loss": [],
"train_zero_one": [],
"val_log_loss": [],
"val_zero_one": [],
}
for step in range(n_steps + 1):
z_train = X_train @ w + b
p_train = sigmoid(z_train)
hist["step"].append(step)
hist["train_log_loss"].append(log_loss_binary(y_train, p_train))
hist["train_zero_one"].append(zero_one_loss_from_proba(y_train, p_train, threshold=threshold))
if X_val is not None and y_val is not None:
z_val = np.asarray(X_val, dtype=float) @ w + b
p_val = sigmoid(z_val)
hist["val_log_loss"].append(log_loss_binary(y_val, p_val))
hist["val_zero_one"].append(zero_one_loss_from_proba(y_val, p_val, threshold=threshold))
else:
hist["val_log_loss"].append(np.nan)
hist["val_zero_one"].append(np.nan)
if step == n_steps:
break
# gradient of mean log loss (plus optional L2 penalty)
grad = p_train - y_train
grad_w = (X_train.T @ grad) / n + l2 * w
grad_b = float(grad.mean())
w -= lr * grad_w
b -= lr * grad_b
return w, b, hist
2) Intuition: thresholds and decision rules (plots)#
0-1 loss depends only on hard labels.
In binary classification, many models output a score or probability \(\hat{p}(y=1\mid x)\). To turn that into a label we pick a threshold \(t\):
As you vary \(t\), the predictions only change when \(t\) crosses one of the predicted probabilities. So the empirical 0-1 loss as a function of \(t\) is a step function (flat most of the time, then jumps).
This is a key reason 0-1 loss is not used as a smooth training objective: small parameter changes often produce no change in 0-1 loss until a point flips sides.
n = 250
x = rng.normal(size=n)
p_true = sigmoid(1.5 * x - 0.3)
y = rng.binomial(1, p_true)
# Pretend these are predicted probabilities from an imperfect model
p_hat = np.clip(p_true + 0.15 * rng.normal(size=n), 1e-3, 1 - 1e-3)
thresholds = np.linspace(0.0, 1.0, 601)
losses = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
acc = 1.0 - losses
t_best, _ = best_threshold_zero_one(y, p_hat)
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Scatter(
x=thresholds,
y=losses,
name="zero-one loss",
mode="lines",
line_shape="hv",
),
secondary_y=False,
)
fig.add_trace(
go.Scatter(
x=thresholds,
y=acc,
name="accuracy (1 - loss)",
mode="lines",
line_shape="hv",
),
secondary_y=True,
)
fig.add_vline(x=0.5, line_dash="dash", line_color="gray", opacity=0.7)
fig.add_vline(x=t_best, line_dash="dot", line_color="crimson")
fig.update_xaxes(title_text="threshold t")
fig.update_yaxes(title_text="0-1 loss", secondary_y=False, range=[0, 1])
fig.update_yaxes(title_text="accuracy", secondary_y=True, range=[0, 1])
fig.update_layout(
title=f"0-1 loss is a step function of the threshold (one optimal t ≈ {t_best:.3f})",
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
)
fig.show()
3) NumPy implementation: sanity checks#
A key property: 0-1 loss is insensitive to confidence.
predicting 0.51 vs 0.99 for the positive class gives the same 0-1 outcome (as long as the thresholded label is the same)
but a probabilistic loss like log loss will strongly prefer 0.99 over 0.51 when the true label is 1
Let’s verify that our NumPy version matches scikit-learn and highlight the “confidence blindness”.
y_true = np.array([1, 0, 1, 1, 0, 0])
y_pred = np.array([1, 0, 0, 1, 0, 1])
print("numpy (mean):", zero_one_loss_np(y_true, y_pred))
print("sklearn (mean):", sk_zero_one_loss(y_true, y_pred))
print("1 - accuracy_score:", 1 - accuracy_score(y_true, y_pred))
print("numpy (count):", zero_one_loss_np(y_true, y_pred, normalize=False))
print("sklearn (count):", sk_zero_one_loss(y_true, y_pred, normalize=False))
w = np.array([1, 1, 5, 1, 1, 1], dtype=float)
print("\nweighted numpy (mean):", zero_one_loss_np(y_true, y_pred, sample_weight=w))
print("weighted sklearn (mean):", sk_zero_one_loss(y_true, y_pred, sample_weight=w))
# multilabel / multioutput: subset 0-1 loss (row must match exactly)
y_true_ml = np.array([[1, 0, 1], [1, 1, 0], [0, 0, 1]])
y_pred_ml = np.array([[1, 0, 1], [1, 0, 0], [0, 1, 1]])
print("\nmultilabel numpy:", zero_one_loss_np(y_true_ml, y_pred_ml))
print("multilabel sklearn:", sk_zero_one_loss(y_true_ml, y_pred_ml))
# confidence blindness: same hard predictions, different probabilities
y_true = np.array([1, 1, 1, 0, 0])
p_soft = np.array([0.51, 0.55, 0.52, 0.49, 0.45])
p_confident = np.array([0.99, 0.90, 0.80, 0.20, 0.01])
print("\n0-1 loss (soft):", zero_one_loss_from_proba(y_true, p_soft))
print("0-1 loss (confident):", zero_one_loss_from_proba(y_true, p_confident))
print("log loss (soft):", log_loss_binary(y_true, p_soft))
print("log loss (confident):", log_loss_binary(y_true, p_confident))
numpy (mean): 0.3333333333333333
sklearn (mean): 0.33333333333333337
1 - accuracy_score: 0.33333333333333337
numpy (count): 2.0
sklearn (count): 2.0
weighted numpy (mean): 0.6
weighted sklearn (mean): 0.6
multilabel numpy: 0.6666666666666666
multilabel sklearn: 0.6666666666666667
0-1 loss (soft): 0.0
0-1 loss (confident): 0.0
log loss (soft): 0.6392579150890872
log loss (confident): 0.11434965799864971
4) Using 0-1 loss for selection/optimization#
Because 0-1 loss is a step function in the threshold (and in the model parameters), it is typically used as a selection criterion rather than a differentiable training objective.
A very common and practical “optimization” task is threshold tuning:
This works well because it is a 1D search (grid search or exact search over unique probabilities).
If you care more about one class (asymmetric costs), you can encode that with sample_weight (or with an explicit cost-sensitive threshold rule).
# Grid search threshold (approximate)
thresholds = np.linspace(0.0, 1.0, 2001)
losses_grid = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
min_loss_grid = float(losses_grid.min())
min_idx = np.where(losses_grid == min_loss_grid)[0]
t_grid = float(thresholds[int(min_idx[0])])
t_grid_low = float(thresholds[int(min_idx[0])])
t_grid_high = float(thresholds[int(min_idx[-1])])
# Exact threshold search (evaluate unique p_hat values)
t_exact, loss_exact = best_threshold_zero_one(y, p_hat)
print(f"grid-search min loss: {min_loss_grid:.4f} (t in [{t_grid_low:.4f}, {t_grid_high:.4f}])")
print(f"exact-search min loss: {loss_exact:.4f} (one optimal t={t_exact:.4f})")
# Weighted: make mistakes on positives 3x more costly
w_pos = np.where(y == 1, 3.0, 1.0)
t_w, loss_w = best_threshold_zero_one(y, p_hat, sample_weight=w_pos)
print(f"weighted best t: {t_w:.4f} (loss={loss_w:.4f})")
losses_unweighted = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
losses_weighted = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t, sample_weight=w_pos) for t in thresholds])
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=losses_unweighted, mode="lines", line_shape="hv", name="unweighted"))
fig.add_trace(go.Scatter(x=thresholds, y=losses_weighted, mode="lines", line_shape="hv", name="weighted (pos×3)"))
fig.add_vline(x=t_exact, line_dash="dot", line_color="black")
fig.add_vline(x=t_w, line_dash="dot", line_color="crimson")
fig.update_layout(title="Threshold tuning for 0-1 loss (unweighted vs weighted)")
fig.update_xaxes(title_text="threshold t")
fig.update_yaxes(title_text="0-1 loss", range=[0, 1])
fig.show()
grid-search min loss: 0.2840 (t in [0.4675, 0.6995])
exact-search min loss: 0.2800 (one optimal t=0.6984)
weighted best t: 0.1358 (loss=0.2457)
4.1 Why 0-1 loss is hard to optimize directly (and what we do instead)#
If a classifier depends on parameters \(\theta\) (e.g. linear model weights), the empirical 0-1 loss is:
This function is:
discontinuous / non-differentiable (jumps when a point flips sides)
typically non-convex and full of plateaus
hard to minimize exactly for most hypothesis classes
So in practice we train with a surrogate loss that is smooth and easier to optimize (e.g. log loss / cross-entropy for logistic regression), and then evaluate with 0-1 loss.
The plots below compare the loss landscapes for a simple 1D logistic model.
n = 120
x = rng.normal(size=n)
x = (x - x.mean()) / x.std()
p_true = sigmoid(2.0 * x - 0.4)
y = rng.binomial(1, p_true)
w_grid = np.linspace(-6, 6, 151)
b_grid = np.linspace(-6, 6, 151)
Z = x[:, None, None] * w_grid[None, None, :] + b_grid[None, :, None]
P = sigmoid(Z)
y_pred = (P >= 0.5).astype(int)
loss01 = (y[:, None, None] != y_pred).mean(axis=0)
eps = 1e-12
P_clip = np.clip(P, eps, 1 - eps)
losslog = -(y[:, None, None] * np.log(P_clip) + (1 - y[:, None, None]) * np.log(1 - P_clip)).mean(axis=0)
# Gradient descent on log loss (same simple 1D model)
w = 0.0
b = 0.0
lr = 0.8
w_path = [w]
b_path = [b]
for _ in range(40):
z = w * x + b
p = sigmoid(z)
grad = p - y
grad_w = float(np.mean(grad * x))
grad_b = float(np.mean(grad))
w -= lr * grad_w
b -= lr * grad_b
w_path.append(w)
b_path.append(b)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("0-1 loss (threshold=0.5)", "log loss (smooth surrogate)"),
horizontal_spacing=0.12,
)
fig.add_trace(
go.Heatmap(x=w_grid, y=b_grid, z=loss01, zmin=0, zmax=1, colorbar=dict(title="0-1")),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(x=w_grid, y=b_grid, z=losslog, colorbar=dict(title="log")),
row=1,
col=2,
)
fig.add_trace(go.Scatter(x=w_path, y=b_path, mode="lines+markers", name="GD path"), row=1, col=1)
fig.add_trace(go.Scatter(x=w_path, y=b_path, mode="lines+markers", showlegend=False), row=1, col=2)
fig.update_xaxes(title_text="w", row=1, col=1)
fig.update_xaxes(title_text="w", row=1, col=2)
fig.update_yaxes(title_text="b", row=1, col=1)
fig.update_yaxes(title_text="b", row=1, col=2)
fig.update_layout(title="0-1 loss is piecewise-constant; log loss provides a smooth optimization landscape")
fig.show()
4.2 Example: train logistic regression (from scratch), evaluate 0-1 loss#
We’ll fit a simple logistic regression model by minimizing log loss with gradient descent, while tracking 0-1 loss on train/validation.
Model:
Training objective (mean log loss):
Then we compute 0-1 loss by thresholding \(\hat{p}\) at \(t=0.5\) (and optionally tuning \(t\) on validation).
X, y = make_blobs(
n_samples=900,
centers=2,
n_features=2,
cluster_std=2.2,
random_state=0,
)
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size=0.3,
random_state=0,
stratify=y,
)
X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)
w, b, hist = fit_logistic_regression_gd(
X_train_s,
y_train,
X_val=X_val_s,
y_val=y_val,
lr=0.2,
n_steps=250,
l2=0.01,
threshold=0.5,
)
p_val = sigmoid(X_val_s @ w + b)
val_loss_05 = zero_one_loss_from_proba(y_val, p_val, threshold=0.5)
t_best, val_loss_best = best_threshold_zero_one(y_val, p_val)
print(f"val 0-1 loss @ t=0.5: {val_loss_05:.4f}")
print(f"best val threshold: {t_best:.4f} (val 0-1 loss={val_loss_best:.4f})")
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=hist["step"], y=hist["train_log_loss"], name="train log loss"), secondary_y=False)
fig.add_trace(go.Scatter(x=hist["step"], y=hist["val_log_loss"], name="val log loss"), secondary_y=False)
fig.add_trace(
go.Scatter(x=hist["step"], y=hist["train_zero_one"], name="train 0-1 loss", line_shape="hv"),
secondary_y=True,
)
fig.add_trace(
go.Scatter(x=hist["step"], y=hist["val_zero_one"], name="val 0-1 loss", line_shape="hv"),
secondary_y=True,
)
fig.update_xaxes(title_text="gradient descent step")
fig.update_yaxes(title_text="log loss", secondary_y=False)
fig.update_yaxes(title_text="0-1 loss", secondary_y=True, range=[0, 1])
fig.update_layout(title="Training with log loss; tracking 0-1 loss (step-like)")
fig.show()
# Decision boundary visualization
x0_min, x0_max = X_train_s[:, 0].min() - 0.8, X_train_s[:, 0].max() + 0.8
x1_min, x1_max = X_train_s[:, 1].min() - 0.8, X_train_s[:, 1].max() + 0.8
x0 = np.linspace(x0_min, x0_max, 220)
x1 = np.linspace(x1_min, x1_max, 220)
xx0, xx1 = np.meshgrid(x0, x1)
grid = np.c_[xx0.ravel(), xx1.ravel()]
prob_grid = sigmoid(grid @ w + b).reshape(xx0.shape)
fig = go.Figure()
fig.add_trace(
go.Contour(
x=x0,
y=x1,
z=prob_grid,
contours=dict(start=0.0, end=1.0, size=0.1),
colorscale="RdBu",
opacity=0.85,
colorbar=dict(title="P(y=1)"),
showscale=True,
)
)
fig.add_trace(
go.Scatter(
x=X_train_s[:, 0],
y=X_train_s[:, 1],
mode="markers",
marker=dict(color=y_train, colorscale="Viridis", opacity=0.9, line=dict(width=0.2, color="black")),
name="train points",
)
)
fig.update_layout(title="Logistic regression probabilities (0-1 loss comes from thresholding)")
fig.update_xaxes(title_text="x0 (standardized)")
fig.update_yaxes(title_text="x1 (standardized)")
fig.show()
val 0-1 loss @ t=0.5: 0.2074
best val threshold: 0.4675 (val 0-1 loss=0.2000)
Pros / cons and when to use 0-1 loss#
Pros#
Highly interpretable: “error rate” (or # mistakes)
Threshold/decision-rule focused: directly measures what many applications care about (correct label)
Works for multiclass with no extra machinery
Aligns with the Bayes classifier under equal misclassification costs (argmax posterior)
Cons#
Non-differentiable / discontinuous → not suitable as a gradient-based training loss
Ignores confidence and calibration: 0.51 and 0.99 are treated the same after thresholding
Can be misleading under class imbalance (a majority-class classifier can look good)
Depends on the decision rule (threshold choice, argmax ties, cost-sensitive adjustments)
Multilabel subset 0-1 is very strict (one wrong label makes the whole example wrong)
When it’s a good choice#
Reporting final performance when all errors are equally costly
Comparing classifiers after you have a clear, fixed threshold / decision policy
Hyperparameter selection when you truly care about accuracy/error rate (using a validation set)
Common pitfalls + diagnostics#
Class imbalance: 0-1 loss/accuracy may hide poor minority performance. Also inspect the confusion matrix; consider balanced accuracy, F1, PR AUC.
Wrong threshold: if your positive class is rare or costs are asymmetric, \(t=0.5\) is often not optimal; tune \(t\) or use cost-sensitive decision rules.
Multilabel strictness: subset 0-1 can be too harsh; consider Hamming loss, Jaccard score, or per-label F1.
Probability quality not measured: two models can have the same 0-1 loss but very different calibration; also report log loss / Brier score if probabilities matter.
Test-set threshold tuning: choose thresholds/hyperparameters on validation (or via cross-validation), not on the test set.
Exercises#
Prove that normalized 0-1 loss is exactly \(1-\text{accuracy}\).
Derive the cost-sensitive threshold \(\eta(x)\ge \frac{c_{01}}{c_{01}+c_{10}}\) from expected cost minimization.
Construct two classifiers with the same 0-1 loss but very different log loss. When would you prefer each?
Extend
best_threshold_zero_oneto return all thresholds achieving the minimum.For multilabel data, compare subset 0-1 loss vs Hamming loss on a synthetic example and interpret the difference.
References#
scikit-learn
zero_one_loss: https://scikit-learn.org/stable/api/generated/sklearn.metrics.zero_one_loss.htmlscikit-learn
accuracy_score: https://scikit-learn.org/stable/api/generated/sklearn.metrics.accuracy_score.htmlHastie, Tibshirani, Friedman: The Elements of Statistical Learning, Ch. 2 (classification), Ch. 4 (linear methods for classification)